Weird bug in TabularDataset.column_names #589

Ark-kun · 2021-08-03T06:10:55Z

I got very weird issue.
I've imported well-known Retail Stockout prediction dataset in CSV format. I've imported the dataset to the Vertex AI Datasets using google.cloud.aiplatform.TabularDataset python library code.

Most columns have the "Wk_" prefix. The screenshot shows that there is only one column with "2016_43_Quantity" in it - "Wk_2016_43_Quantity" column. Just like in the source CSV.
Everything is fine.

But here is the problem:
When I call the API to get the dataset metadata including the column names, all column names are fine except one which is stated as "WWk_2016_43_Quantity". (Notice the double "W" in the "WWk_" prefix).
In context:
...
'Wk_2016_42_Quantity',
'WWk_2016_43_Quantity',
'Wk_2016_44_Quantity',
...

This discrepancy causes the subsequent model training to fail due to the dataset not having the WWk_2016_43_Quantity column (it has Wk_2016_43_Quantity instead).

I do not understand how this could have happened, but you can easily examine the imported dataset and see that the UX and and what returned by the google-cloud-aiplatform library differs.

Environment details

OS type and version: Linux
Python version: 3.7
google-cloud-aiplatform version: 1.1.1

Steps to reproduce

Create dataset from the "gs://kubeflow-pipelines-regional-us-central1/mirror/cloud-ml-data/automl-tables/notebooks/stockout.csv" file
Try getting its columns

Code example

from google.cloud import aiplatform
print(aiplatform.TabularDataset('projects/140626129697/locations/us-central1/datasets/2405036550225133568').column_names)

The text was updated successfully, but these errors were encountered:

Fixes googleapis#589 The `end` parameter of the `blob.download_as_bytes` function is inclusive, not exclusive. > There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

…component There is a hack to work around the issue googleapis/python-aiplatform#589 that I fixed in googleapis/python-aiplatform#590

Fixes #589 The `end` parameter of the `blob.download_as_bytes` function is inclusive, not exclusive. > There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. Co-authored-by: gcf-merge-on-green[bot] <60162190+gcf-merge-on-green[bot]@users.noreply.github.com>

…component There is a hack to work around the issue googleapis/python-aiplatform#589 that I fixed in googleapis/python-aiplatform#590

product-auto-label bot added the api: aiplatform Issues related to the AI Platform API. label Aug 3, 2021

Ark-kun mentioned this issue Aug 3, 2021

fix: Fixed the column name corruption bug in TabularDataset.column_names #590

Merged

4 tasks

yoshi-automation added the triage me I really want to be triaged. label Aug 4, 2021

sasha-gitg closed this as completed in #590 Aug 4, 2021

Ark-kun mentioned this issue Oct 31, 2021

fix: Fixed bug in TabularDataset.column_names Ark-kun/python-aiplatform#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird bug in TabularDataset.column_names #589

Weird bug in TabularDataset.column_names #589

Ark-kun commented Aug 3, 2021

Weird bug in TabularDataset.column_names #589

Weird bug in TabularDataset.column_names #589

Comments

Ark-kun commented Aug 3, 2021

Environment details

Steps to reproduce

Code example