Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird bug in TabularDataset.column_names #589

Closed
Ark-kun opened this issue Aug 3, 2021 · 0 comments · Fixed by #590
Closed

Weird bug in TabularDataset.column_names #589

Ark-kun opened this issue Aug 3, 2021 · 0 comments · Fixed by #590
Labels
api: aiplatform Issues related to the AI Platform API. triage me I really want to be triaged.

Comments

@Ark-kun
Copy link
Contributor

Ark-kun commented Aug 3, 2021

I got very weird issue.
I've imported well-known Retail Stockout prediction dataset in CSV format. I've imported the dataset to the Vertex AI Datasets using google.cloud.aiplatform.TabularDataset python library code.

Most columns have the "Wk_" prefix. The screenshot shows that there is only one column with "2016_43_Quantity" in it - "Wk_2016_43_Quantity" column. Just like in the source CSV.
Everything is fine.

But here is the problem:
When I call the API to get the dataset metadata including the column names, all column names are fine except one which is stated as "WWk_2016_43_Quantity". (Notice the double "W" in the "WWk_" prefix).
In context:
...
'Wk_2016_42_Quantity',
'WWk_2016_43_Quantity',
'Wk_2016_44_Quantity',
...

This discrepancy causes the subsequent model training to fail due to the dataset not having the WWk_2016_43_Quantity column (it has Wk_2016_43_Quantity instead).

I do not understand how this could have happened, but you can easily examine the imported dataset and see that the UX and and what returned by the google-cloud-aiplatform library differs.

Environment details

  • OS type and version: Linux
  • Python version: 3.7
  • google-cloud-aiplatform version: 1.1.1

Steps to reproduce

  1. Create dataset from the "gs://kubeflow-pipelines-regional-us-central1/mirror/cloud-ml-data/automl-tables/notebooks/stockout.csv" file
  2. Try getting its columns

Code example

from google.cloud import aiplatform
print(aiplatform.TabularDataset('projects/140626129697/locations/us-central1/datasets/2405036550225133568').column_names)
@product-auto-label product-auto-label bot added the api: aiplatform Issues related to the AI Platform API. label Aug 3, 2021
Ark-kun added a commit to Ark-kun/python-aiplatform that referenced this issue Aug 3, 2021
Fixes googleapis#589

The `end` parameter of the `blob.download_as_bytes` function is inclusive, not exclusive.

> There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.
Ark-kun added a commit to Ark-kun/pipeline_components that referenced this issue Aug 3, 2021
Ark-kun added a commit to Ark-kun/pipeline_components that referenced this issue Aug 3, 2021
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Aug 4, 2021
sasha-gitg pushed a commit that referenced this issue Aug 4, 2021
Fixes #589

The `end` parameter of the `blob.download_as_bytes` function is inclusive, not exclusive.

> There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

Co-authored-by: gcf-merge-on-green[bot] <60162190+gcf-merge-on-green[bot]@users.noreply.github.com>
Ark-kun added a commit to Ark-kun/pipeline_components that referenced this issue Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: aiplatform Issues related to the AI Platform API. triage me I really want to be triaged.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants