feat: add Pandas DataFrame support to TabularDataset #1185

sararob · 2022-04-26T21:59:06Z

Adds support for uploading a local Pandas DataFrame as a Vertex TabularDataset via a create_from_dataframe method on the TabularDataset class.

Also added relevant tests to test_datasets.py

Fixes b/189369695 🦕

google/cloud/aiplatform/datasets/tabular_dataset.py

sasha-gitg · 2022-04-27T13:54:57Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+            df_source (pd.DataFrame):
+                Required. Pandas DataFrame containing the source data for
+                ingestion as a TabularDataset.
+            staging_path (str):


It seems the location requirements should also be documented or a reference to this documentation should be provided: https://cloud.google.com/vertex-ai/docs/general/locations#bq-locations

If possible they should be validated but not a hard requirement. Is it possible for the dataset create to fail because of the regional requirements?

Good point, I just tested it and it can fail if the dataset location doesn't match the project location or the service doesn't have the right access to the dataset. I'll update the docstring to link to that page.

In terms of validating, the BQ client throws this error: google.api_core.exceptions.FailedPrecondition: 400 BigQuery Dataset location eu must be in the same location as the service location us-central1.

Do you think we should validate as well or let the BQ client handle validation? If we do validation, we'd need to use the BQ client to check the location of the provided BQ dataset string.

Agree with relying on BQ client.

sasha-gitg · 2022-04-27T13:56:14Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+            parquet_options = bigquery.format_options.ParquetOptions()
+            parquet_options.enable_list_inference = True
+
+            job_config = bigquery.LoadJobConfig(


Will this config infer all the types? I see the enable_list_inference but I couldn't find a reference in the BQ docs for non list type inference.

Yes, this will infer the data types from the DF. From the BQ docs:

The destination table to use for loading the data. If it is an existing table, the schema of the :class:`~pandas.DataFrame` must match the schema of the destination table. If the table does not yet exist, the schema is inferred from the :class:`~pandas.DataFrame`.

I added a bq_schema param to give the user the option to override the data type autodection, but I think I may need to add more client-side validation on that.

I think we can rely on BQ client validation.

tests/unit/aiplatform/test_datasets.py

google/cloud/aiplatform/datasets/tabular_dataset.py

tests/unit/aiplatform/test_datasets.py

sasha-gitg · 2022-04-27T14:05:02Z

tests/unit/aiplatform/test_datasets.py

@@ -147,6 +148,30 @@

 _TEST_LABELS = {"my_key": "my_value"}

+# create_from_dataframe


Preference for an integration test as well.

Added 2 integration tests: one with the bq_schema param and one without.

sasha-gitg · 2022-05-02T19:46:11Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+            df_source (pd.DataFrame):
+                Required. Pandas DataFrame containing the source data for
+                ingestion as a TabularDataset.
+            staging_path (str):


Agree with relying on BQ client.

sasha-gitg · 2022-05-02T19:53:03Z

google/cloud/aiplatform/datasets/tabular_dataset.py

@@ -19,12 +19,19 @@

 from google.auth import credentials as auth_credentials

+from google.cloud import bigquery
+from google.cloud.bigquery import _pandas_helpers


We shouldn't rely on a private module from a dependent python package because it may break between minor versions.

sasha-gitg · 2022-05-02T19:55:02Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+                Required. The user-provided BigQuery schema.
+        """
+        try:
+            _pandas_helpers.dataframe_to_arrow(dataframe, schema)


If the BQ client handles this validation we can rely on the client and avoid importing this private module.

Removed this validation check and letting the BQ client handle it.

sasha-gitg · 2022-05-02T19:55:24Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+            parquet_options = bigquery.format_options.ParquetOptions()
+            parquet_options.enable_list_inference = True
+
+            job_config = bigquery.LoadJobConfig(


I think we can rely on BQ client validation.

sasha-gitg · 2022-05-02T19:58:05Z

tests/system/aiplatform/test_dataset.py

+        dataset.location = _TEST_LOCATION
+        shared_state["bigquery_dataset"] = bigquery_client.create_dataset(dataset)
+
+        yield


It seems like BQ dataset deletion should be handled after the yield.

sasha-gitg · 2022-05-02T19:58:42Z

tests/system/aiplatform/test_dataset.py

+
+        aiplatform.init(project=_TEST_PROJECT, location=_TEST_LOCATION)
+
+        tabular_dataset = aiplatform.TabularDataset.create_from_dataframe(


tabular_dataset should be appended to shared_state['resources'] so it's deleted after the test

see example: https://github.com/googleapis/python-aiplatform/blob/main/tests/system/aiplatform/test_e2e_tabular.py#L89

Updated both new tests to use shared_state['resources']

sasha-gitg · 2022-05-02T20:00:25Z

tests/system/aiplatform/test_dataset.py

+    bigquery.SchemaField(name="string_array_col", field_type="STRING", mode="REPEATED"),
+    bigquery.SchemaField(name="bytes_col", field_type="STRING"),
+]
+

 class TestDataset:


This should enherit from the base e2e class so it can take advantage of the resource cleanup functionality: https://github.com/googleapis/python-aiplatform/blob/main/tests/system/aiplatform/e2e_base.py#L37

Updated to inherit from e2e_base. I noticed there was some duplication of other fixtures in this file so I removed that and inherited the fixtures from the base class.

helinwang · 2022-05-04T04:13:10Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+            bq_schema (Optional[Union[str, bigquery.SchemaField]]):
+                Optional. The schema to use when creating the staging table in BigQuery. For more details,
+                see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema
+                This is not needed if the BigQuery table provided in `staging_path` already exists.


Does this imply when the staging table does not exit, user need to pass in BQ schema? No existing staging table is the most common case, and this would be a bit hard to use - users need to learn how to create BQ schema, etc.
Given df_source already has a schema, can we auto generate the BQ schema for users?

The best case is BQ client support inferencing schema natively, could you check on that? BQ UI supports inferencing schema for JSONL. Haven't tried Parquet.

The BQ client will autodetect the schema using the DataFrame column types if the user doesn't provide this bq_schema parameter, this happens here. So I'm guessing most users won't use this schema param to override the autodetect.

I think my docstring was confusing. I updated it, let me know if this is clearer:

Optional. If not set, BigQuery will autodetect the schema using your DataFrame's column types. If set, BigQuery will use the schema you provide when creating the staging table. For more details, see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema

sararob added 3 commits April 26, 2022 16:13

add create_from_dataframe method

f5785dd

add tests for create_from_dataframe

ae01b66

update docstrings and run linter

e2de699

product-auto-label bot added the size: m Pull request size is medium. label Apr 26, 2022

sararob requested review from rosiezou, sasha-gitg and helinwang April 26, 2022 22:00

update docstrings and make display_name optional

1983471

sasha-gitg requested changes Apr 27, 2022

View reviewed changes

updates from sashas feedback: added integration test, update validations

80b2ef4

sararob requested a review from a team as a code owner April 27, 2022 23:18

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Apr 27, 2022

sararob requested a review from sasha-gitg April 28, 2022 13:17

sararob and others added 3 commits April 28, 2022 09:23

remove some logging

19d8565

update error handling on bq_schema arg

0f81b3d

Merge branch 'main' into sr-pandas-tabular-dataset

c6b4ed9

sasha-gitg reviewed May 2, 2022

View reviewed changes

updates from sashas feedback

833d9f5

sararob requested a review from sasha-gitg May 3, 2022 02:01

sasha-gitg approved these changes May 3, 2022

View reviewed changes

helinwang reviewed May 4, 2022

View reviewed changes

update bq_schema docstring

87ac7f9

sararob requested a review from helinwang May 4, 2022 14:23

Merge branch 'main' into sr-pandas-tabular-dataset

791ca84

product-auto-label bot added the api: vertex-ai Issues related to the googleapis/python-aiplatform API. label May 5, 2022

Merge branch 'main' into sr-pandas-tabular-dataset

a43b923

sararob merged commit 4fe4558 into googleapis:main May 5, 2022

release-please bot mentioned this pull request May 5, 2022

chore(main): release 1.13.0 #1181

Merged

release-please bot mentioned this pull request Jun 8, 2023

chore(main): release 1.24.1 #2196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Pandas DataFrame support to TabularDataset #1185

feat: add Pandas DataFrame support to TabularDataset #1185

sararob commented Apr 26, 2022

sasha-gitg Apr 27, 2022

sararob Apr 27, 2022

sasha-gitg May 2, 2022

sasha-gitg Apr 27, 2022

sararob Apr 27, 2022 •

edited

Loading

sasha-gitg May 2, 2022

sasha-gitg Apr 27, 2022

sararob Apr 28, 2022

sasha-gitg May 2, 2022

sasha-gitg May 2, 2022

sasha-gitg May 2, 2022

sararob May 3, 2022

sasha-gitg May 2, 2022

sasha-gitg May 2, 2022

sasha-gitg May 2, 2022

sararob May 3, 2022

sasha-gitg May 2, 2022

sararob May 3, 2022

helinwang May 4, 2022 •

edited

Loading

helinwang May 4, 2022

sararob May 4, 2022

		@@ -147,6 +148,30 @@

		_TEST_LABELS = {"my_key": "my_value"}

		# create_from_dataframe


		aiplatform.init(project=_TEST_PROJECT, location=_TEST_LOCATION)

		tabular_dataset = aiplatform.TabularDataset.create_from_dataframe(

feat: add Pandas DataFrame support to TabularDataset #1185

feat: add Pandas DataFrame support to TabularDataset #1185

Conversation

sararob commented Apr 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sararob Apr 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang May 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sararob Apr 27, 2022 •

edited

Loading

helinwang May 4, 2022 •

edited

Loading