Wildcard path(s) for the GCS tabular dataset are not working #957

glebrh · 2022-01-17T12:11:13Z

Dear team,

It seems that despite being mentioned in the docstring and documentation of the TabularDataset, wildcard paths for the tabular GCS datasets are not supported. This is related both to the metadata parsing (e.g. viewing column names) as well as model training using this dataset

See the examples below

Environment details

Windows 10
Python version: 3.8.12
google-cloud-aiplatform version:1.8.1

Steps to reproduce

Create a dataset with wildcard path pointing to the GCS bucket with csv files
Try to view column names of such a dataset
Try to run a custom training job using the dataset

Code example

ds = aiplatform.TabularDataset.create(
    display_name="tabular-dataset-test", gcs_source='gs://<bucketname>/<path-to-files>/*', sync=True
)

print(ds.column_names)

job = aiplatform.CustomTrainingJob(...)
model = job.run(ds, sync=True)

Stack trace

~\.conda\envs\vertex-sample\lib\site-packages\google\cloud\aiplatform\training_jobs.py in run(self, dataset, annotation_schema_uri, model_display_name, model_labels, base_output_dir, service_account, network, bigquery_destination, args, environment_variables, replica_count, machine_type, accelerator_type, accelerator_count, boot_disk_type, boot_disk_size_gb, reduction_server_replica_count, reduction_server_machine_type, reduction_server_container_uri, training_fraction_split, validation_fraction_split, test_fraction_split, training_filter_split, validation_filter_split, test_filter_split, predefined_split_column_name, timestamp_split_column_name, enable_web_access, tensorboard, sync)
   2066         )
   2067 
-> 2068         return self._run(
   2069             python_packager=python_packager,
   2070             dataset=dataset,

~\.conda\envs\vertex-sample\lib\site-packages\google\cloud\aiplatform\base.py in wrapper(*args, **kwargs)
    673                 if self:
    674                     VertexAiResourceNounWithFutureManager.wait(self)
--> 675                 return method(*args, **kwargs)
    676 
    677             # callbacks to call within the Future (in same Thread)

~\.conda\envs\vertex-sample\lib\site-packages\google\cloud\aiplatform\training_jobs.py in _run(self, python_packager, dataset, annotation_schema_uri, worker_pool_specs, managed_model, args, environment_variables, base_output_dir, service_account, network, bigquery_destination, training_fraction_split, validation_fraction_split, test_fraction_split, training_filter_split, validation_filter_split, test_filter_split, predefined_split_column_name, timestamp_split_column_name, enable_web_access, tensorboard, reduction_server_container_uri, sync)
   2319         )
   2320 
-> 2321         model = self._run_job(
   2322             training_task_definition=schema.training_job.definition.custom_task,
   2323             training_task_inputs=training_task_inputs,

~\.conda\envs\vertex-sample\lib\site-packages\google\cloud\aiplatform\training_jobs.py in _run_job(self, training_task_definition, training_task_inputs, dataset, training_fraction_split, validation_fraction_split, test_fraction_split, training_filter_split, validation_filter_split, test_filter_split, predefined_split_column_name, timestamp_split_column_name, annotation_schema_uri, model, gcs_destination_uri_prefix, bigquery_destination)
    749         _LOGGER.info("View Training:\n%s" % self._dashboard_uri())
    750 
--> 751         model = self._get_model()
    752 
    753         if model is None:

~\.conda\envs\vertex-sample\lib\site-packages\google\cloud\aiplatform\training_jobs.py in _get_model(self)
    836             RuntimeError: If Training failed.
    837         """
--> 838         self._block_until_complete()
    839 
    840         if self.has_failed:

~\.conda\envs\vertex-sample\lib\site-packages\google\cloud\aiplatform\training_jobs.py in _block_until_complete(self)
    886             time.sleep(wait)
    887 
--> 888         self._raise_failure()
    889 
    890         _LOGGER.log_action_completed_against_resource("run", "completed", self)

~\.conda\envs\vertex-sample\lib\site-packages\google\cloud\aiplatform\training_jobs.py in _raise_failure(self)
    903 
    904         if self._gca_resource.error.code != code_pb2.OK:
--> 905             raise RuntimeError("Training failed with:\n%s" % self._gca_resource.error)
    906 
    907     @property

RuntimeError: Training failed with:
code: 5
message: "Google Cloud Storage file(s) not found: [gs://<bucketname>/<path-to-files>/*]"

The text was updated successfully, but these errors were encountered:

ivanmkc · 2022-01-27T22:13:07Z

Taking a look at this.

ivanmkc · 2022-01-28T18:27:43Z

After some testing, I believe that wildcards are not supported. Please enumerate the files you need beforehand and pass them in without wildcards.

Thanks for bringing this to our attention. We'll correct the documentation to reflect this.

kawofong · 2022-04-22T19:31:25Z

When can we expect the documentation to be updated? I recently ran into this issue and wasted hours of debugging before realizing that this is a known problem with the documentation.

yoshi-automation added the triage me I really want to be triaged. label Jan 18, 2022

busunkim96 added api: aiplatform Issues related to the AI Platform API. aiplatform Issues related to the AI Platform (Unified) API. and removed triage me I really want to be triaged. labels Jan 18, 2022

yoshi-automation added the triage me I really want to be triaged. label Jan 18, 2022

andrewferlitsch assigned sasha-gitg Jan 20, 2022

yoshi-automation added the 🚨 This issue needs some love. label Jan 22, 2022

morgandu assigned ivanmkc and unassigned sasha-gitg Jan 24, 2022

kweinmeister added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Jan 26, 2022

ivanmkc added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jan 27, 2022

ivanmkc closed this as completed Jan 28, 2022

ivanmkc reopened this May 11, 2022

product-auto-label bot added api: vertex-ai Issues related to the googleapis/python-aiplatform API. and removed api: aiplatform Issues related to the AI Platform API. labels May 11, 2022

ivanmkc mentioned this issue May 11, 2022

fix: Fixed docstrings for wildcards and matching engine type #1220

Merged

4 tasks

rosiezou closed this as completed in #1220 Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wildcard path(s) for the GCS tabular dataset are not working #957

Wildcard path(s) for the GCS tabular dataset are not working #957

glebrh commented Jan 17, 2022

ivanmkc commented Jan 27, 2022

ivanmkc commented Jan 28, 2022

kawofong commented Apr 22, 2022

Wildcard path(s) for the GCS tabular dataset are not working #957

Wildcard path(s) for the GCS tabular dataset are not working #957

Comments

glebrh commented Jan 17, 2022

Environment details

Steps to reproduce

Code example

Stack trace

ivanmkc commented Jan 27, 2022

ivanmkc commented Jan 28, 2022

kawofong commented Apr 22, 2022