Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Pandas DataFrame support to TabularDataset #1185

Merged
merged 12 commits into from
May 5, 2022

Conversation

sararob
Copy link
Contributor

@sararob sararob commented Apr 26, 2022

Adds support for uploading a local Pandas DataFrame as a Vertex TabularDataset via a create_from_dataframe method on the TabularDataset class.

Also added relevant tests to test_datasets.py

Fixes b/189369695 🦕

@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Apr 26, 2022
google/cloud/aiplatform/datasets/tabular_dataset.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/datasets/tabular_dataset.py Outdated Show resolved Hide resolved
df_source (pd.DataFrame):
Required. Pandas DataFrame containing the source data for
ingestion as a TabularDataset.
staging_path (str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the location requirements should also be documented or a reference to this documentation should be provided: https://cloud.google.com/vertex-ai/docs/general/locations#bq-locations

If possible they should be validated but not a hard requirement. Is it possible for the dataset create to fail because of the regional requirements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I just tested it and it can fail if the dataset location doesn't match the project location or the service doesn't have the right access to the dataset. I'll update the docstring to link to that page.

In terms of validating, the BQ client throws this error: google.api_core.exceptions.FailedPrecondition: 400 BigQuery Dataset location eu must be in the same location as the service location us-central1.

Do you think we should validate as well or let the BQ client handle validation? If we do validation, we'd need to use the BQ client to check the location of the provided BQ dataset string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with relying on BQ client.

parquet_options = bigquery.format_options.ParquetOptions()
parquet_options.enable_list_inference = True

job_config = bigquery.LoadJobConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this config infer all the types? I see the enable_list_inference but I couldn't find a reference in the BQ docs for non list type inference.

Copy link
Contributor Author

@sararob sararob Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will infer the data types from the DF. From the BQ docs:

            The destination table to use for loading the data. If it is an
            existing table, the schema of the :class:`~pandas.DataFrame`
            must match the schema of the destination table. If the table
            does not yet exist, the schema is inferred from the
            :class:`~pandas.DataFrame`.

I added a bq_schema param to give the user the option to override the data type autodection, but I think I may need to add more client-side validation on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rely on BQ client validation.

tests/unit/aiplatform/test_datasets.py Outdated Show resolved Hide resolved
tests/unit/aiplatform/test_datasets.py Outdated Show resolved Hide resolved
tests/unit/aiplatform/test_datasets.py Show resolved Hide resolved
tests/unit/aiplatform/test_datasets.py Outdated Show resolved Hide resolved
@@ -147,6 +148,30 @@

_TEST_LABELS = {"my_key": "my_value"}

# create_from_dataframe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preference for an integration test as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 2 integration tests: one with the bq_schema param and one without.

@sararob sararob requested a review from a team as a code owner April 27, 2022 23:18
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Apr 27, 2022
@sararob sararob requested a review from sasha-gitg April 28, 2022 13:17
df_source (pd.DataFrame):
Required. Pandas DataFrame containing the source data for
ingestion as a TabularDataset.
staging_path (str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with relying on BQ client.

@@ -19,12 +19,19 @@

from google.auth import credentials as auth_credentials

from google.cloud import bigquery
from google.cloud.bigquery import _pandas_helpers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't rely on a private module from a dependent python package because it may break between minor versions.

Required. The user-provided BigQuery schema.
"""
try:
_pandas_helpers.dataframe_to_arrow(dataframe, schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the BQ client handles this validation we can rely on the client and avoid importing this private module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this validation check and letting the BQ client handle it.

parquet_options = bigquery.format_options.ParquetOptions()
parquet_options.enable_list_inference = True

job_config = bigquery.LoadJobConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rely on BQ client validation.

dataset.location = _TEST_LOCATION
shared_state["bigquery_dataset"] = bigquery_client.create_dataset(dataset)

yield
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like BQ dataset deletion should be handled after the yield.


aiplatform.init(project=_TEST_PROJECT, location=_TEST_LOCATION)

tabular_dataset = aiplatform.TabularDataset.create_from_dataframe(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tabular_dataset should be appended to shared_state['resources'] so it's deleted after the test

see example: https://github.com/googleapis/python-aiplatform/blob/main/tests/system/aiplatform/test_e2e_tabular.py#L89

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated both new tests to use shared_state['resources']

bigquery.SchemaField(name="string_array_col", field_type="STRING", mode="REPEATED"),
bigquery.SchemaField(name="bytes_col", field_type="STRING"),
]


class TestDataset:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should enherit from the base e2e class so it can take advantage of the resource cleanup functionality: https://github.com/googleapis/python-aiplatform/blob/main/tests/system/aiplatform/e2e_base.py#L37

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to inherit from e2e_base. I noticed there was some duplication of other fixtures in this file so I removed that and inherited the fixtures from the base class.

@sararob sararob requested a review from sasha-gitg May 3, 2022 02:01
bq_schema (Optional[Union[str, bigquery.SchemaField]]):
Optional. The schema to use when creating the staging table in BigQuery. For more details,
see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema
This is not needed if the BigQuery table provided in `staging_path` already exists.
Copy link

@helinwang helinwang May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply when the staging table does not exit, user need to pass in BQ schema? No existing staging table is the most common case, and this would be a bit hard to use - users need to learn how to create BQ schema, etc.
Given df_source already has a schema, can we auto generate the BQ schema for users?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best case is BQ client support inferencing schema natively, could you check on that? BQ UI supports inferencing schema for JSONL. Haven't tried Parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BQ client will autodetect the schema using the DataFrame column types if the user doesn't provide this bq_schema parameter, this happens here. So I'm guessing most users won't use this schema param to override the autodetect.

I think my docstring was confusing. I updated it, let me know if this is clearer:

Optional. If not set, BigQuery will autodetect the schema using your DataFrame's column types.
If set, BigQuery will use the schema you provide when creating the staging table. For more details,
see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema

@sararob sararob requested a review from helinwang May 4, 2022 14:23
@product-auto-label product-auto-label bot added the api: vertex-ai Issues related to the googleapis/python-aiplatform API. label May 5, 2022
@sararob sararob merged commit 4fe4558 into googleapis:main May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: vertex-ai Issues related to the googleapis/python-aiplatform API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants